-
Notifications
You must be signed in to change notification settings - Fork 49
Don't create Host instances with random host_id #623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Some tests are still failing, but I wanted to ask if the direction is good @dkropachev |
7f061e1 to
ef382b9
Compare
|
@Lorak-mmk maybe you know, why this test assumes that the new_host should be different? |
ef382b9 to
9598dd5
Compare
|
I have no idea. Was this test passing until now and non-flaky? If so, then perhaps there is such logic somewhere. |
|
Now that I think of it: I see that driver uses LBP to decide order of hosts to connect. See |
Makes sense, second question: in this test: in this tests it is assumed that both queries should use the same host, as they use different instances of RoundRobinPolicy and they start from the same host? But how this can be true if the position when we start is randomized here: https://github.com/scylladb/python-driver/blob/master/cassandra/policies.py#L182 |
|
No idea. Perhaps |
9598dd5 to
9e162dd
Compare
cassandra/policies.py
Outdated
| if not self.local_dc: | ||
| self.local_dc = dc | ||
| return HostDistance.LOCAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not be in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sylwiaszunejko, what is the reason for having it here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it is not obvious, nor explained anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sylwiaszunejko , it looks like you reintroduced it in recent push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually needed for any test to pass, now the distance is called before on_add/up in add_or_renew_pool and we need local_dc to have not null value there, I agree it wasn't explained enough, if it is None all Hosts are marked as ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lorak-mmk @dkropachev
Ok so my findings are:
right now the flow is like this:
- we try to establish cc
- get
system.localandsystem.peersresults - call
on_addon cluster for every discovered host https://github.com/scylladb/python-driver/blob/master/cassandra/cluster.py#L2013 - it calls
distanceandon_addon lbp (at this point populate was not called yet as we don't know any hosts at the beginning, so _endpoints on dc aware policy are not set, and we cannot set local_dc in on_add) https://github.com/scylladb/python-driver/blob/master/cassandra/policies.py#L254-L263 - tests fails as all hosts are consider IGNORED (empty
local_dc)
before my change we would call populate with fake hosts (with proper endpoints but wrong hosts_ids), and _endpoints in dc aware policy would be correctly assigned and usable in on_add when we discover proper hosts (with right host_ids).
To solve this I think we should omit the _endpoints logic (setting it to cluster.resolved_endpoints in populate and checking if the host endpoint is in _endpoints before setting local_dc to default in on_up) and just setting local_dc to the host.datacenter regardless if it is or not in contact points provided to the cluster. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be use case when users can relay on random dc/rack assignment that comes from DNS.
Say you have a single dns for whole cluster then you target driver to it not specifying dc/rack.
Let's change _refresh_node_list_and_token_map to specifically find and proccess a row that matches endpoint on the connection it is running on.
Then it will endup in distance where policy can learn dc or rack from.
Please don't forget to make same changes to RackAwareRoundRobinPolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When implementing RackAwareRoundRobinPolicy it was decided not to do implicit dc / rack #332
Specific comment: #332 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkropachev I am not sure I get your approach, isn't querying system.local on that connection and processing host info from it enough? Or you mean I should proccess it as a first host before peers result?
I thought we agreed that the distance is not a good place to assign default dc.
Do you agree with removing _endpoints logic from dc aware policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed new version with local host at the beginning of the list of hosts to proccess
This test was working because |
|
In the previous approach (calling populate with one host) were the |
|
You could then adjust the test, not remove it. |
|
adddec1 to
3e864fc
Compare
3e864fc to
0a1aa0e
Compare
dd1eb6f to
a6cf3aa
Compare
a6cf3aa to
fef57ae
Compare
fef57ae to
8124928
Compare
|
Please let me review before merging, |
| try: | ||
| host = [live_hosts[self.host_index_to_use]] | ||
| if len(live_hosts) > self.host_index_to_use: | ||
| host = [live_hosts[self.host_index_to_use]] | ||
| except IndexError as e: | ||
| raise IndexError( | ||
| 'You specified an index larger than the number of hosts. Total hosts: {}. Index specified: {}'.format( | ||
| len(live_hosts), self.host_index_to_use | ||
| )) from e | ||
| return host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously index error (happening if len(live_hosts) <= host_index_to_use) was caught, error was printed, and then exception rethrown (presumably failing the test).
Now you introduced an if which prevents IndexError from happening at all.
- If this change really is desirable, the code handling IndexError should be removed - it is dead.
- Please explain reason for this change. Why this condition should now return empty plan instead of exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, I should have checked if len(live_hosts) is not 0 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition may have changed, but I still don't understand why its necessary. Why this specific case should return empty plan, instead of throwing an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we initialize lbp only after we learn all the hosts from cc (not like before where lbp was populated with values from cluster config), during establishing cc I specifically handled the case that lbp returns empty query plan and we use resolved enpoints
| with pytest.raises((WriteTimeout, Unavailable)): | ||
| self.session.execute(query, timeout=None) | ||
| finally: | ||
| get_node(1).resume() | ||
|
|
||
| # Change the scales stats_name of the cluster2 | ||
| cluster2.metrics.set_stats_name('cluster2-metrics') | ||
|
|
||
| stats_cluster1 = self.cluster.metrics.get_stats() | ||
| stats_cluster2 = cluster2.metrics.get_stats() | ||
|
|
||
| # Test direct access to stats | ||
| assert 1 == self.cluster.metrics.stats.write_timeouts | ||
| assert (1 == self.cluster.metrics.stats.write_timeouts or 1 == self.cluster.metrics.stats.unavailables) | ||
| assert 0 == cluster2.metrics.stats.write_timeouts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the exception thrown change?
cassandra/policies.py
Outdated
| if not self.local_dc: | ||
| self.local_dc = dc | ||
| return HostDistance.LOCAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it is not obvious, nor explained anywhere.
8124928 to
c381c19
Compare
|
@Lorak-mmk I haven't yet figured out why in |
Lorak-mmk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit Don't create Host instances with random host_id should be the last one, right? Without test fixes introduced in subsequent commit, this commit can't pass tests I think.
| try: | ||
| host = [live_hosts[self.host_index_to_use]] | ||
| if len(live_hosts) > self.host_index_to_use: | ||
| host = [live_hosts[self.host_index_to_use]] | ||
| except IndexError as e: | ||
| raise IndexError( | ||
| 'You specified an index larger than the number of hosts. Total hosts: {}. Index specified: {}'.format( | ||
| len(live_hosts), self.host_index_to_use | ||
| )) from e | ||
| return host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition may have changed, but I still don't understand why its necessary. Why this specific case should return empty plan, instead of throwing an exception?
| self.conviction_policy = conviction_policy_factory(self) | ||
| if not host_id: | ||
| host_id = uuid.uuid4() | ||
| raise ValueError("host_id may not be None") | ||
| self.host_id = host_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit: "Don't create Host instances with random host_id"
The change here is the one that the commit message explains. Perhaps the chain((host.endpoint for host in lbp.make_query_plan()), self._cluster.endpoints_resolved) line is also explained. Other changes are not explained, and are not at all obvious to me.
When writing commits, please assume that a reader won't be as familiar with the relevant code as you are. It is almost always true - even if reviewer is an active maintainer, there is high chance they did not work with this specific area recently.
2d17c1b to
14f78b5
Compare
7b7cf1f to
02acb4c
Compare
… starting point The `test_profile_lb_swap` test logic assumed that `populate` was called before control connection (cc) was created, meaning only the contact points from the cluster configuration were known (a single host). Due to that the starting point was not random. This commit updates the test to reflect the new behavior, where `populate` is called on the load-balancing policy after the control connection is created. This allows the policy to be updated with all known hosts and ensures the starting point is properly randomized.
Previously, the driver relied on the load-balancing policy (LBP) to determine the order of hosts to connect to. Since the default LBP is Round Robin, each reconnection would start from a different host. After removing fake hosts with random IDs at startup, this behavior changed. When the LBP is not yet initialized, the driver now uses the endpoints provided by the control connection (CC), so there is no guarantee that different hosts will be selected on reconnection. This change updates the test logic to first establish a connection and initialize the LBP, and only then verify that two subsequent reconnections land on different hosts in a healthy cluster.
Only compare hosts endpoints not whole Host instances as we don't know hosts ids.
In DC aware lbp when local_dc is not provided we set it in on_add and it needs to be initialized for distance to give proper results.
Previously, we used endpoints provided to the cluster to create Host instances with random host_ids in order to populate the LBP before the ControlConnection was established. This logic led to creating many connections that were opened and then quickly closed, because once we learned the correct host_ids from system.peers, we removed the old Hosts with random IDs and created new ones with the proper host_ids. This commit introduces a new approach. To establish the ControlConnection, we now use only the resolved contact points from the cluster configuration. Only after a successful connection do we populate Host information in the LBP. If the LBP is already initialized during ControlConnection reconnection, we reuse the existing values.
02acb4c to
fff9753
Compare

This PR fixes inefficiencies in the host initialization mechanism when bootstrapping a cluster.
Previously, the driver created
Hostinstances with connections from the contact points provided in the cluster configuration using random host IDs. After establishing the control connection and reading fromsystem.peers, these initialHostinstances were discarded and replaced with new ones created using the correct host metadata. This approach resulted in unnecessary creation and teardown of multiple connections.Changes
system.localandsystem.peers.Hostinstances are created with the correcthost_idvalues.Fixes: #619